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Abstract 

We theoretically and experimentally investigate tensor-based regression and clas¬ 
sification. Our focus is regularization with various tensor norms, including the 
overlapped trace norm, the latent trace norm, and the scaled latent trace norm. 
We first give dual optimization methods using the alternating direction method of 
multipliers, which is computationally efficient when the number of training samples 
is moderate. We then theoretically derive an excess risk bound for each tensor 
norm and clarify their behavior. Finally, we perform extensive experiments using 
simulated and real data and demonstrate the superiority of tensor-based learning 
methods over vector- and matrix-based learning methods. 


1 Introduction 


A wide range of real-world data takes the format of mat rices and t e nsors , e.g., rec¬ 
ommendation dKaratzoglou et al.l . l2010h . video sequences flKim et al.l. 120071) . climates 
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(Bahado ri et all 201411 . genomes (ISankaranaravanan et all 120151) . and neuro-imaging 


v — , - 7 o- v - u< - - - - ■/ 7 - - -o o 

( Zhou et all 12013 ). A naive way to learn from such matrix and tensor data is to vectorize 
them and apply ordinary regression or classification methods designed for vectorial data. 
However, such a vectorization approach would lead to loss in structural information of 
matrices and tensors such as low-rankness. 

The objective of this paper is to investigate regression and classification methods that 
directly handle tensor data without vectorization. Low-rank structure of data has_been 


succe ssfully utilized in various applications s uch as missing data imputation (iCai et al. 


2010), robust princip al component analysis flCandes et all l201lh . and subspace cluster¬ 


ing flLiu et all 120101) . In this paper, instead of low-rankness of data itself, we consider 
its dual— learning coefficients of a regressor and a classifier. Low-rankness in learning 
coefficients means that only a subspace of feature space is used for regression and classi¬ 
fication. 

Fo r mat rices, regressi o n and classification has been studied in iTomioka and Aihara 
( 2007 ) and Zhou and Li ( 2014h in the context of EEG data analysis. It was experi¬ 
mentally demonstrated that directly learning matrix data by low-rank regularization can 
significantly improve the performance compared to learning after vectorization. Another 
advantage of using low-rank regularization in the context of EEG data analysis is that an¬ 
alyzing singular value spectra of learning coefficients is useful in understanding activities 
of brain regions. 

Mor e rec ently, an inductive learning method for tensors has been explored 
(Signoretto et ah, 2013). Compared to the matrix case, learning with tensors is inherently 
more complex. For example, the multilinear ranks of tensors make it more complicated 
to End a proper low-rankness of a tensor compared to matrices which has only one rank. 
So far, several te nsor norms such as the o verlapped trace norm or the tensor nuclear norm 
(ILiu et all 120091) , the latent trace norm (ITomioka and Suzuki! 120131) . and the scaled la¬ 
tent trace norm ( Wimalawarne et al. . 2014h have been proposed and demonstrated to 
perform well for various tensor structures. However, theoretical analysis of tensor learn¬ 
ing in inductive learning settings has not been much investigated yet. Another challenge 
in inductive tensor learning is efficient optimization strategies, since tensor data often 
has much higher dimensionalities than matrix and vector data. 

In this paper, we theoretically and experimentally investigate tensor-based regression 
and classification with regularization by the overlapped trace norm, the latent trace norm, 
and the scaled latent trace norm. We first provide their dual formulations and pro pose 
optimization procedures using the alternating direction method of multipliers (Bertsekas, 


19961), which is computationally efficient when the number of data samples is moderate. 


We then derive an excess risk bound for each tensor regularization, which allows us to 
theoretically understand the behavior of tensor norm regularization. More specifically, 
we elucidate that the excess risk of the overlapped trace norm is bounded with the 
average multilinear ranks of each mode, that of the latent trace norm is bounded with 
the minimum multilinear rank among all modes, and that of the scaled latent trace norm 
is bounded with the minimum ratio between multilinear ranks and mode dimensions. 
Finally, for simulated and real tensor data, we experimentally investigate the behavior 
of tensor-based regression and classification methods. The experimental results are in 
concordance with our theoretical findings, and tensor-based learning methods compare 
favorably with vector- and matrix-based methods. 
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The remainder of the paper is organized as follows. In Section 2, we formulate the 
problem of tensor-based supervised learning and review the overlapped trace norm, the 
latent trace norm, and the scaled latent trace norm. In Section 3, we derive dual optimiza¬ 
tion algorithms based on the alternating direction method of multipliers. In Section 4, 
we theoretically give an excess risk bound for each tensor norm. In Section 5, we give 
experimental results on both artificial and real-world data and illustrate the advantage 
of tensor-based learning methods. Finally, in Section 6, we conclude this paper. 


Notation 


Throughout the paper, we use standard tensor notation following Ko lda an d Bader 
(2009). We represent a K -way tensor as W G fl^nix-xn* (-]. ia y consists of N — Y\k=i n k 
elements. A mode-fc fiber of W is an n^-dimensional vector which can be obtained by 
fixing all except the fcth index. The mode-fc unfolding of tensor W is represented as 
W(k) G W lkXN / nk which is obtained by concatenating all the N/n^ mode-A; fibers along its 
columns. The spectral norm of a matrix X is denoted by ||A"|| op which is the maximum 
singular value of X. The operator (W, X) is the sum of element-wise multiplications of 
W and X , i.e., (W, X) = vec(W) T vec(A). The Frobenius norm of a tensor X is defined 
as ||A|| f = ^(X,X). 


2 Learning with Tensor Regularization 

In this section, we put forward inductive tensor learning models with tensor regularization 
and review different tensor norms used for low-rank regularization. 

2.1 Problem Formulation 

Our focus in this paper is regression and classification of tensor data. Let us consider 
a data set (X^y^i = l,...,m, where X t G W llX '" xnK is a covariate tensor and y.i is 
a target, r/, G R for regression, while y j G {—1,1} for classification. We consider the 
following learning model for a tensor norm || • ||*: 

m 

min y2l(Xi,yi, W,b) + A||W||*, (1) 

W,b z ' 

2—1 

where Z(A), y i: W, b) is the loss function: the squared loss, 

l{Xi, Uii w, b) = { Vi - «W, Xj) + b )) 2 , (2) 

is used for regression, and the logistic loss, 

l{Xi, y u W, b) = log(l + exp(—r/j((W, X % ) + b)), (3) 

is used for classification, b G M is the bias term and A > 0 is the regularization parameter. 
If || • ||* = || • ||2 or || • ||i, then the above problem is equivalent to ordinary vector-based 
l' 2 ~ or ^-regularization. 
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To understand the effect of tensor-based regularization, it is important to investigate 
the low-rankness of tensors. When considering a matrix W G R niXn2 , its trace norm is 
defined as 

j 

imi,=£o, (4) 

3 =1 


where aj is the j th singular value and J is the number of non-zero singular values (J < 
min(ni, 712 )). A matrix is called law rank if J < min(rti, n 2 )- The matrix trace norm 
0 is a convex envelop to the matrix rank and it is commonly used in matrix low-rank 
approximation (Recht et ah, 2010). 

As in matrices, the rank property is also available for tensors, but it is more 
complicated due to its multidimensional structure. The mode-fc rank ry of a tensor 
W G R niX '” xn ^ is defined as the rank of mode-A; unfolding W^) and the multilinear rank 
of W is given as (ri,..., Vk)- The mode-i of a tensor W is called low rank if r* < 7ij. 


2.2 Overlapped Trace Norm 


One of the earliest definitions of a tensor norm is the tensor nuclear norm (Liu et al. 


2009) or the overlapped trace norm flTomioka and Suzukil . 120131) . which can be represented 


for a tensor W G M n ix---xn K as 


K 


11W11 overlap = 


(5) 


k =1 


The overlapped trace norm can be viewed as a direct extension of the matrix trace norm 
since it unfolds a tensor on each of its mode and computes the sum of trace norms of the 
unfolded matrices. Regularization with the overlapped trace norm can also be seen as 
an overlapped group regularization due to the fact that the same tensor is unfolded over 
different modes and regularized with the trace norm. 

One of the popular applications of the overlapped trace norm is tensor completion 
f Gandv et al. . 2011 ; Liu et all 2009 b where missin g entries of a tensor are i mput ed. An¬ 
other application is multilinear multitask learning ( Romera-Paredes et ah . 2013h . where 
multiple vector-based linear learning tasks with a common feature space are arranged as 
a tensor feature structure and the multiple tasks are solved together with constraints to 
minimize the multilinear ranks of the tensor feature. 

Theoretical anal yses o n th e ove rlapped norm have been carried out for both 
te nsor completion ( Tomioka and Suzuki 2013 1 and multilinear multitask learning 


(IW i malawarne et all 2014); they have shown that the prediction error of overlapped 


trace norm regularization is bounded by the average mode-A; ranks which can be large if 
some modes are close to full rank even if there are low-rank modes. Thus, these studies 
imply that the overlapped trace norm performs well when the multilinear ranks have 
small variations, and it may result in a poor performance when the multilinear ranks 
have high variations. 

To overcome the weakness of the overlapped trace norm, recent r esear ch in tenso r 
norms has led to new norms such as the latent trace norm flTomioka and Suzuki! . 1201311 


and the scaled latent trace norm (IWimalawarne et all 1201411 
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2.3 Latent Trace Norm 


Tomioka and Suzuki ( 20131 ) proposed the latent trace norm as 


K 

latent = inf V'||Vh, 

w( 1 )+w( 2 )+...+w( K )=n'r-' 

k =1 


( fc )ll 

(k) lln- 


The latent trace norm takes a mixture of K latent tensors which is equal to the number 
of modes, and regularizes each of them separately. In contrast to the overlapped trace 
norm, the latent tensor trace norm regularizes different latent tensors for each unfolded 
mode and this gives the tendency that the latent tensor trace norm picks the latent tensor 
with the lowest rank. 

In general, the latent trace norm results in a mixture of latent tensors and the content 
of each latent tensor would depend on the rank of its unfolding. In an extreme case, for a 
tensor with all its modes full except one mode, regularization with the latent tensor trace 
norm would result in making the latent tensor with the lowest mode become prominent 
while others become zero. 


2.4 Scaled Latent Trace Norm 


Recently, Wimalawarne et al. ( 2014 ) proposed the scaled latent trace norm as an exten¬ 
sion of the latent trace norm: 


K 


scaled 


inf 


W(i) + w(»+...+wW=wf-f xAffc " ^ lltr 

k =1 v 




(fc)l 


Compared to the latent trace norm, the scaled latent trace norm takes the rank relative 
to the mode dimension. A major drawback of the latent trace norm is its inability to 
identify the rank of a mode relative to its dimension. If a tensor has a mode where 
its dimension is smaller than other modes yet its relative rank with respect to its mode 
dimension is high compared to other modes, the latent trace norm could incorrectly pick 
the smallest mode. 

The scaled latent norm has the ability to overcome this problem by its scaling with the 
mode dimensions such that it is able to work with the relative ranks of the tensor. In the 
context of multilinear multitask learning, it has been shown that the scaled latent trace 
norm works well for tensors with high variations in multilinear ranks and_mode dimensions 


compared to the overlapped trace norm and the latent trace norm (1 Wimalawarne et ah 
2014h . 

The inductive learning setting mentioned in (CD) with the overlapped trace norm has 


been studied previously in IS ignoretto et al . 020130 . However, theoretical analysis and 


performance comparison with other tensor norms have not been conducted yet. Similarly 
to tensor decomposit i on (|Tomioka and Suzuki! 120131) and multilinear multitask learning 
(Wimalawarne et ah, 20.14), tensor-based regression and classification may also be im¬ 
proved by regularization methods that can work with high variations in multilinear ranks 
and mode dimensions. 

In the following sections, to make tensor-based learning more practical and to im¬ 
prove the performance, we consider formulation ([1]) with the overlapped trace norm, the 


5 

































latent trace norm, and the scaled latent trace norm, and give computationally efficient 
optimization algorithms and excess risk bounds. 


3 Optimization 


In this section, we consider the dual formulation for (JT]) and propose computationally 
efficient optimization algorithms. Since optimization of (CD with regularization using 
the overlapped trace norm has already been studied in ISi gnoretto et ah ( 20131) . we do 
not discuss it again here. Our main focus in this section is optimization of ([T]) with 
regularization using the latent trace norm and the scaled latent trace norm. 


Let us consider the formulation ([I]) for a data set {Xi, y t ) e M ni 
with latent and scaled latent trace norm regularization as follows: 


X---XTIK 


xl, i — 1 ,..., m 


m K 

P{W,b)= min l(X h Vi , W, b) + ^ 

W( 1 )+...+W( K )=W.6^-' f—' 

i=l k =1 


A t ||W ( ( ‘>| ltn 


( 6 ) 


where, for k = 1 ,...,K and for any given regularization parameter A, A*, = A in the 
case of the latent trace norm and At. = —p= in the case of the scaled latent trace norm, 

respectively. is the unfolding of Won its kth mode. It is worth noticing that the 
application of the latent and scaled latent trace norms requires optimizing over K latent 
tensors which contain KN variables in total. For large K and N, solving the primal 
problem (JB]) can be computationally expensive especially in non-linear problems such as 
logistic regression, since they require computationally expensive optimization methods 
such as gradient descent or the Newton method. If the number of training samples m is 
m <C KN, solving the dual problem of ([6]) could be computationally more efficient. For 
this reason, we focus on optimization in the dual below. 

The dual formulation of © can be written as follows (its detailed derivation is given 
in Appendix A): 

K 


min D(—ct) 

a.VWr-.vW 




m 

ssutject to V (A ’ = oiiXi (k — 1,..., K ,) 

i —1 


E a ‘ = °- ( 7 ) 

where a = (ai,..., a m ) T G M m are dual variables corresponding to the training data set 
{Xi, Di),i — 1 ,... ,m, D{—ol ) is the conjugate loss function deffiied as 


£>(-«) = £ 

2=1 



m 


the case of regression with the squared loss flTomioka et al. , 2011c ). and 


m 

D(-cx ) = log (yioti) + (1 - y^) log(l - y^f) 

2=1 
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with constraint 0 < j/jcq < 1 in the case of classihcation with the logistic loss 
(Tom i oka et all 2011q). 5\ k is the indicator function defined as 5\ k (V) = 0 if HVHop < Xk 
and 5\ k (V) = oo otherwise. The constraint = 0 is due to the bias term b. Here, 

the auxiliary variables V (1 \ ..., V (iV ' ) are introduced to remove the coupling between the 
indicator functions in the objective function (see Appendix A for details). 


The alternatin g direction method of multipliers (ADMM) (IGabav and Mercieii . Il976 


Bovd et al.l.l201lh has been previously used to solve primal proble ms of tensor decomposi 


tion ( Tomioka et al. . 2011bh and multilinear multi-task learning ( Romera-Paredes et al. 


2013 ) with the overlapped trace norm regularization . Optimization in the dual for tensor 


decomposition problems with the latent_and s caled la tent trace norm regularization has 
been solved using ADMM in Tomioka et al.l ( 2011b ). Here, we also adopt ADMM to 
solve o, and describe the formulation and the optimization steps in detail. 

With introduction of dual variables G M niX "' xn ^ ; k — 1,..., K (corresponding 
to the primal variables of (j6j) ), b G M, and parameter (3 > 0, the augmented Lagrangian 
function for (0 is defined as follows: 




K 


k =1 


D(-a) + V ( S^V<») + ( l\f y V a t x m - V$> 

' i =1 

m 2 \ m 

yy a i x m - v ( ( k) ) + & 


i=i 


i =1 


oti 


i= 1 


This ADMM formulation is solved for variables a, V^,..., ..., W*^, and b by 

considering sub-problems for each variable. Below, we give the solution for each variable 
at iterative step t + 1. 

The first sub-problem to solve is for a. at step t + 1: 

cc t+1 = argmaxL(a, {V (fc)t }f =1 , {W (fc)t }f =1 , b f ), 


where {V^'^}f =1 , and 6* are the solutions obtained at step t. 

Depending on the conjugate loss D(—a), the solution for a differs. In the case of 
regression with the squared loss ([2]), the augmented Lagrangian can be minimized with 
respect to a by solving the following linear equation: 

(■ K/3XX r + I + pi m ll)a. t+1 = (y - Xvec(W t ) + /LAvec(V 4 ) - 1 m 6*), 

where X = [vec(A 1 ) T ; • • • ;vec(A m ) T ] G R mxN , V* = Ef=i V (fc)t , W = Ef=i 
y = (yi,... , y m ) T , and l m is the m-dimensional vector of all ones. Note that, in the 
above system of equations, coefficient matrix multiplied with a does not change during 
optimization. Thus, it can be efficiently solved at each iteration by precomputing the 
Cholesky factorization of the matrix. 

For classihcation with the logistic loss (J3]), the Newton method is used to 
fold the solution for a t+1 , which requires the gradient and the Hessian of 
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£(a,{V<‘>}f = „{W«}'L 1 ,6): 

SL(a,{VW}f. 1 ,{WW}f. 1 ,t) 


don 


= Vi log 


ViOti 


1 - ViOti 


K 


X< w(fe) ^> 


K 


k=1 
„t+l 


5 2 L(a, {VWJL, 6) = }y iai (l- ym) 

Kp^XJ+P 


+pJ2( x "J2 x ‘ a ‘‘ + ‘ - v(t> ‘ )+ b + P E 

fc=l ' i=l ' i=1 

1 +A'/3(^,^) + /3 (i = j), 


dotidctj 




Next, we update at step t + 1 by solving the following sub-problem: 

V (fe)t+1 = argmaxL(a t+1 ,vW,{V^}f^,{>V^}f =1 ,6 t ) 

v« 

r(k)t 


'W'' m 

= P r °JA t 1 ~ir L + XI a i +lj ^(*0 

i=1 


0 


( 8 ) 


where proj A (VF) = 17111111(5, A)17 T and IT = USV T . 

Finally, we update the dual variables W ffc i and b at step t + 1 as 


W, 


(k)t+l 


» - wM + pfjTcf'Xw-vM* 1 ), 

' i=l ' 

m 

b t+i = b t +pj2 a ^ +1 - 

i =1 

Note that step (JSJ) and step ((UJ) can be combined as 


(9) 

( 10 ) 


W, 


(k)t +1 _ 


(fc) - P rOX /9Afc ( ^(k) + I 3 X 
A »=i 

where prox A (bF) = f/max(5 — A, 0)V T and W = USV T . This allows us to avoid comput¬ 
ing singular values and the associated singular vectors that are smaller than the threshold 

A k in ©. 


Optimality Condition 


As a stopping condition, we use the relative duality gap (iTom ioka et all 12011a ), which 
can be expressed as 

P{W\b l ) - £>(-<**) 

p(w, v) - e ’ 

where P(W t , tf) is the primal solution at step i of © and e is a predefined tolerance 
value. D{—a t ) is the dual solution at step t of ([7]) with a obtained by multiplying 


ol with min ( 1, 


IThbwIlc 


IThOwllc 


Ai t ■ ■ ■ i \ K 
largest singular value of 17. 


, where V(a) = Y^i =i A"and 


op 


is the 













4 Theoretical Risk Analysis 


In this section, we theoretically analyze the excess risk for regularization with the over¬ 
lapped trace norm, the latent trace norm, and the scaled latent trace norm. 

We consider a loss function l which is Lipshitz continuous with constant A. Note that 
this condition is true for both the squared loss and logistic loss functions. Let the training 
data set be given as (Xi,yf) G M nix '" xnK x Y, i = 1 ,m, where Y G R for regression 
and Y G {—1,1} for classification. In our theoretical analysis, we assume that elements 
of Xi independently follow the stan dard Gaussian distribution. 

As the standard formulation ( Maurer and Poriti] . 2013 1. the empirical risk without 
the bias term is defined as 


_| / # L 

R(w) = -Y,i((w,x i ),yi), 

m ' 

i= 1 

and the expected risk is defined as 


Rm=E (x>yhlli l({W,X),y), 

where /i is the probability distribution from which (X tl yf) are sampled. 

The optimal W° that minimizes the expected risk is given as 

W° = arg min R(W) subject to IIWIL < B 0 , (11) 

w 

where || • ||* is either the overlapped trace norm, the latent trace norm, or the scaled latent 
trace norm. The optimal W that minimizes the empirical risk is denoted as 

VV = arg min R(W) subject to I |WL < B 0 . (12) 

vv 

The next lemma provides an upper bound of the excess risk for tensor-based learning 
problems (see Appendix B for its proof), where ||W||** is the dual norm of ||W||* for 
* = {overlap, latent, scaled}: 

Lemma 1. For a given A-Lipchitz continuous loss function l and for any W G M niX "' xnK 
such that ||W||* < B 0 for problems (HU)-(ED , the excess risk for a given training data 
set (Xi, yf) G M niX '" XTl ^ xR,i = l,...,m is bounded with probability at least 1 — 5 as 


R(VV) - R(W°) < -ABoE\\M\\*. + (13) 

m V 2/u 


where A 4 = (7 i^i an d T: G { — 1,1} are Rademacher random variables. 


The next theorem gives an excess risk bound for overlapped trace norm regularization 
(its proof is also included in Appendix B), which is based on the inequality 
J2k =i given in Tomioka and Suzuki ( 2013 ): 


overlap A 


9 









Theorem 1. With probability at least 1—<5, the excess risk of learning using the overlapped 
trace norm regularization for any W° with ||W°||f < B, multilinear ranks (ri,... ,tk), 
and estimator TV with B 0 < B Y^k =i bounded as 


R{W ) - R(W °) < ciA 




min(y / nfc + y'nyT) + c 2 

k 



(14) 


where n\k = n j an< ^ c i anc ^ c 2 are constants. 

In the next theorem, we give an excess risk bound for the latent trace norm (its proof 
is also included in Appendix B). w hich uses the inequality 11W11latent < V min*, r k | 
given in 


Tomioka and Suzuki! ( 2013 1: 


Theorem 2. With probability at least 1 — 5, the excess risk of learning using the latent 
norm regularization for any W° with || W°||f < B, multilinear ranks (ri,..., vk), and 
estimator W with B 0 < By/ min*, r k is bounded as 


R(W) — R(W°) < ci A B 


mini, r k 


m 


ma+ y/nfjf) + Cy/2\og(K) j + c 2 1 


'Ml) 


2 m 


(15) 


where n t = fl 


n.j and c.\, c 2 , and C are constants. 


The above theorem shows that the excess risk for the latent trace norm ((131) is bounded 
by the minimum multilinear rank. If ri\ = ■ • • = the latent trace norm is always better 
then the overlapped trace norm in terms of the excess risk bounds because \J miry r k < 
Y^k =1 y/rf. If the dimensions n±,... ,Hk are not the same, the overlapped trace norm 
could be better. 

Finally, we bound the excess risk for the scaled latent trace norm based on the in¬ 


equality 


scaled A 


miI MS 


f given in Wimalawarne et al. (2014): 


Theorem 3. With probability at least 1 — 5, the excess risk of learning using the 
scaled latent trace norm regularization for any W° with ||W°||f < B, multilinear ranks 

(r!,..., tk). and estimator VV with B 0 < B J miry (^-) is bounded as 


R(W)-R(W°) < CiABil — min ( — 

m k \nk 


f ma x(nk+VN)+C \/21og(Ji)) +c 2 J 


(16) 


where c 1; c 2 , and C are constants. 

Note that when rq = ■ • • = rq = n and the multilinear ranks r 1; ..., are different, 
the bounds in Theorem 2 and Theorem 3 are the same. 

Theorem 3 shows that the excess risk for regularization with the scaled latent trace 
norm is bounded with the minimum of multilinear ranks relative to their mode dimen¬ 
sions. Similarly to the latent trace norm, the scaled latent trace norm would also perform 
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better than the overlapped norm when the multilinear ranks have large variations. If we 
consider a “flat” tensor, the modes with small dimensions may have ranks comparable to 
their dimensions. Although these modes have the lowest mode-fc rank, they do not im¬ 
pose a low-rank structure. In such cases, our theory predicts that the scaled latent trace 
norm performs better because it is sensitive to the mode-fc rank relative to its dimension. 

As a variation, we can also consider a mode-wise “scaled” version of the overlapped 
trace norm defined as ||W|| SO veriap : = Y^k=i y^ll^dolltr- It can be easily seen that 

11W11 SO veriap < i .^/^IIWIIf holds and with the same conditions as in Theorem 1, 
we can upper-bound the excess risk for the scaled overlapped trace norm regularization 
as 

R (W) - R(W d ) < Cl A-|( ± ^ mi„K + VN) + c 2 f^> . (17) 


Note that when all modes have the same dimensions, (1171) coincides with (TT4j) . Compared 
with bound (fTblh the scaled latent norm would perform better than the scaled overlapped 

norm regularization since min*, < EfcLi 


5 Experiments 

We conducted several experiments using simulated and real-world data to evaluate the 
performance of tensor-based regression and classification methods with regularizations 
using different tensor norms. We discuss simulations for tensor-based regression in Section 
5.1, experiments with real-world data for tensor classification in Section 5.2. For all 
experiments, we use a MATLAB® environment on a 2.10 GHz (2x8 cores) Intel Xeon 
E5-2450 server machine with 128 GB memory. 


5.1 Tensor Regression with Artificial Data 


We report the results of artificial data experiments on tensor-based regression. 

We generated three different 3-mode tensors as weight tensors W with different multi¬ 
linear ranks and mode dimensions. We created two homogenous tensors with equal mode 
dimensions of n\ = n 2 — n 3 — 10 with different multilinear ranks (r 1 ,r 2 ,r 3 ) = (3,3,3) 
and (r 1 ,r 2 ,r 3 ) = (3,5,8). The third weight tensor is an inhomogenous case with mode 
dimensions of n\ = 4, n 2 = n 3 = 10 and multilinear ranks f n, r 2 , r 3 ) = _(A4,8). To gen¬ 
erate these weight tensors, we use the Tucker decomposition (Kolda and Bader, 2009) of 
a tensor as W = C xf, = 1 U^ k \ where C G M n xr2Xr3 j s the core tensor and U^ G W kXnk are 
component matrices. We sample elements of the core tensor C from a standard Gaussian 
distribution, choose component matrices g M rfeXn >= to be orthogonal matrices, and 
generate W by mode-wise multiplication of the core tensor and component matrices. 

To create training samples {A, y,;}” =1 , we first create the random tensors A gener¬ 
ated with each element independently sampled from the standard Gaussian distribution 
and obtain i/i = (W, A)) + z/*, where zy is noise drawn from the Gaussian distribution 
with mean zero and variance 0.1. In our experiments we use cross validation to select 
the regularization parameter from range 0.01-100 at intervals of 0.1. For the purpose of 
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comparison, we have also simulated matrix regularized regressions for each mode unfold¬ 
ing. Also, we experimented with cross validation among matrix regularization on each 
unfolded matrix to understand whether it can find the correct mode for regularization. 
As the baseline vector-based learning method, we use ridge regression (i.e., ^-regularized 
least-squares). 

Figure 1 shows the performance of homogenous tensors with equal mode dimensions 
n\ = 7i2 = ns = 10 and equal multilinear ranks (ri,r 2 ,r 3 ) = (3,3,3). We see that the 
overlapped norm performs the best, while both latent norms perform equally (since mode 
dimensions are equal) but inferior to the overlapped norm. Also, the regression results 
from all matrix regularizations with individual modes perform better than the latent and 
the scaled latent norm regularized regression models. Due to the equal multilinear ranks 
and equal mode dimensions, it results in equal performance with cross validation among 
each mode-wise unfolded matrix regularization. 

Figure 2 shows the performances of homogenous tensors with equal mode dimensions 
ri\ = n 2 = n 3 = 10 and unequal multilinear ranks (ri,r 2 ,r 3 ) = (3,5,8). In this case, 
both the latent and the scaled latent norms also perform equally since tensor dimensions 
are the same. The mode-1 regularized regression models give the best performance since 
it has the lowest rank and regularization with the latent and scaled latent norms gives 
the next best performance. The mode-wise cross validation correctly coincides with the 
mode-1 regularization. The overlapped norm performs poorly compared to the latent and 
the scaled latent trace norms. 

Figure 3 shows the performance of inhomogenous tensors with mode dimensions ri\ = 
4, n 2 = n 3 = 10 and multilinear ranks (ri,r 2 ,r 3 ) = (3,4,8). In this case, we can see 
that the scaled latent trace norm outperforms all other tensor norms. The latent trace 
norm performs poorly since it fails to find the mode with the lowest rank. This well 
agrees with onr theoretical analysis: as shown in (1151) . the excess risk of the latent trace 
norm is bounded with the minimum of multilinear ranks, which is on the first mode in 
the current setup and it is high ranked. The scaled latent trace norm is able to find 
the mode with the lowest rank since it takes the relative rank with respect to the mode 
dimension as in (fl6|) . If we look at the individual mode regularizations, we see that the 
best performance is given with the second mode, which has the lowest rank with respect 
to the mode dimension, and the worst performance is given with the first mode, which is 
high ranked compared to other modes. Here, the mode-wise cross validation is again as 
good as mode-2 regularization. 

It is also worth noticing in all above experiments that ridge regression performed 
worse than all the tensor regularized learning models. This highlights the necessity of 
employing low-rank inducing norms for learning with tensor data without vectorization 
to get the best performance. 

Figure 4 shows the computation time for the toy regression experiment with inho¬ 
mogenous tensors with mode dimensions n\ = 4, n 2 = n 3 = 10 and multilinear ranks 
(ri,r 2 ,r 3 ) = (3,4,8) (computation time for other setups showed similar tendency and 
thus we omit the results). For each data set, we measured the computation time of 
training regression models, cross validation for model selection, and predicting output 
values for test data. We can see that methods based on tensor norms and matrix norms 
are computationally much more expensive compared to ridge regression. However, as we 
saw above, they achieves higher accuracy than ridge regression. It is worth noticing that 
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Figure 1: Simulation results of tensor regression based on homogenous weight tensor of 
equal mode dimensions n\ = n 2 = ri 3 — 10 and equal multilinear ranks {r\,r 2 ,r^) = 
(3,3,3) 
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Figure 2: Simulation results of tensor regression based on homogenous weight tensor of 
equal modes sizes ri\ = n 2 = n 3 = 10 and unequal multilinear rank (r!,r 2 ,r 3 ) = (3,5,8) 
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Figure 3: Simulation results of tensor regression based on inhomogenous weight tensor of 
equal modes sizes ri\ = 4, n 2 = 77-3 = 10 and multilinear rank (ri,r 2 ,r 3 ) = (3,4,8) 
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Figure 5: Samples of hand motion sequences of left/flat and left/spread 


mode-wise cross validation is computationally more expensive compared to the scaled 
latent trace norm and other tensor norms. This computational advantage and compa¬ 
rable performance with respect to the best mode-wise regularization makes the scaled 
latent trace norm a useful regularization method for tensor-based regression especially 
for tensors with high variations in its multilinear ranks. 


5.2 Tensor Classification for Hand Gesture Recognition 


Next, we report the results of experiments on tensor classification with the Cambridge 
hand gesture data set (Kim e t all 20071). 

The Cambridge hand gesture data set contains image sequences from 9 gesture classes. 
These gesture classes include 3 primitive hand shapes of flats, spread, and V-shape, and 
3 different hand motions of rightward, leftward, and contrast. Each class has 100 image 
sequences with different illumination conditions and arbitrary motions of two people. 
Previously, the tensor canonical correlation (Kim ct al.j, 2007 1 has been used to classify 
these hand gestures. 

To apply tensor classification, first we build action sequences as tensor data by sam¬ 
pling S images with equal time intervals from each sequence. This makes each sequence a 


tenso r of 20 x 20 x S, where the first two modes are down-sampled images as in (iKirn et ah 


20071 1 and S is the number of sampled images. In our experiments, we set S' at 5 or 10. We 
consider binary classification and we have chosen visually similar sequences of left/flat 
and left/spread (Figure 5), which we found to be difficult to classify. We apply stan¬ 
dardization of data by mean removal and variance normalization. We randomly sample 
data into a training set of 120 data elements, use a validation set of 40 data elements to 
select the optimal regularization parameter, and finally use a test set of 40 elements to 
evaluate the learned classifier. In addition to the tensor regularized learning models, we 
also trained classifiers with matrix regularization with unfolding on each mode separately. 
As a baseline vector-based learning method, we have used the ^-regularized logistic re¬ 
gression. We also trained mode-wise cross validation with individual mode regularization 
(Mode-wise CV). We repeated the learning procedure for 10 sample sets for each classifier 
and the results are shown in Table 1. 

In both experiments for S' = 5 and 10, we see that tensor norm regularized classi¬ 
fication performs better than the vectorized learning method. With tensor structure of 
(20,20,5), we can see that the scaled latent norm gives the best performance and the 
latent trace norm, mode-1, mode-3, and mode-wise cross validation gives are compara¬ 
ble. We observed that, with the tensor structure of (20, 20, 5), the resulted weight tensor 
after learning its third mode becomes full rank. The scaled latent trace norm performed 
the best since it could identify the mode with the minimum rank relative to its mode 
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Table 1: Classification error of experiments with the hand gesture data set. The boldfaced 
figures indicate comparable accuracies among classifiers after a t-test with significance of 
0.05. _ 



Tensor Dimensions 

Norm 

(20,20,5) 

(20,20,10) 

Overlapped Trace Norm 

0.1425(0.0512) 

0 . 0722 ( 0 . 0363 ) 

Latent Trace Norm 

0 . 1175 ( 0 . 0487 ) 

0 . 0806 ( 0 . 0512 ) 

Scaled Latent Trace Norm 

0 . 0975 ( 0 . 0478 ) 

0 . 0944 ( 0 . 0512 ) 

Mode-1 

0 . 1050 ( 0 . 0422 ) 

0 . 0950 ( 0 . 0438 ) 

Mode-2 

0.1400(0.0709) 

0 . 0900 ( 0 . 0459 ) 

Mode-3 

0 . 1200 ( 0 . 0405 ) 

0.1100(0.0592) 

Mode-wise CV 

0 . 1050 ( 0 . 0542 ) 

0 . 0950 ( 0 . 044 ) 

Logistic regression (l 2 ) 

0.1975(0.0640) 

0.1925(0.0782) 


dimension, which was the first mode in the current setup. The overlapped trace norm 
performs poorly due to large variations in the multilinear ranks and tensor dimensions. 

With the tensor structure (20,20,10), the overlapped trace norm gives the best per¬ 
formance. In this case, we found that the multilinear ranks are close to each other, which 
made the overlapped norm to give better performance. The scaled latent trace norm, 
latent trace norm, mode-1, mode-2, and mode-wise cross validation gave comparable 
performance with the overlapped trace norm. 


5.3 Tensor Classification for Brain Computer Interface 


As our second tensor classification, we experimented with a motor-imagery EEG classi¬ 
fication problem in the context of brain computer interface (BCI). The objective of the 
experiments was to classify movements imagined by person using the EEG signals cap¬ 
tured AnAhmkinstancej^or our experiments, we us ed the data from the BCI co mpetition 
IVa flDornhege et all 1200411 . Previous research by iTomioka and Aiharal (120070 has con¬ 
sidered “channel x channel” as a matrix of the EEG signal and classified it using logistic 
regression with low-rank matrix regularization. Our objective is to model EEG data as 
tensors to incorporate more information and learn to classify using tensor regularization 
methods. The BCI competition IVa data set consists of BCI experiments of five people. 
Though BCI experiments have used 256 channels, we only use signals from 49 channels 
following Tomioka and Aiharal ( 2007 1 and pre-process each signal from each channel with 
Z different band-pass filters (Butterworth filters). Let S t e M c ' xT , where C denotes the 
number of channels and T den otes the time, be the matrix obtained by processing with 


the 7 h filter. As in Tomioka and Aiharal ( 2007h . each S) is further processed to make 


centering and scaling as S) = C—j Sflr — 11 T ). Then we obtain X t = SiSj , which is a 
“channel x channel” matrix (in our setting, it is 49 x 49). We arrange all X,,i = 1,Z 
to form a tensor of dimensions Z x 49 x 49. 

For our experiments, we used Z — 5 different band-pass Butterworth Liters with cut¬ 
off frequencies of (7, 10), (9 12), (11 14), (13 16) and (15 18) with scaling by 50 which 
resulted in a signal converted into a tensor of dimensions 5 x 49 x 49. We split the data 
used in the competition into training and validation sets with proportion of 80 : 20, and 
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the rest of the data are used for testing. As in the previous experiment, we used logistic 
regression with all the tensor norms, individual mode unfolded matrix regularizations, 
and cross validation with unfolded matrix regularization. We also used vector-based lo¬ 
gistic regression with ^-regularization for comp arison. To com pare te nsor-based methods 
with the previously proposed matrix approach (Tomioka and Aihara, 2007), we averaged 
tensor data over the frequency mode and applied classification with matrix trace norm 
regularization. For all experiment, we selected all regularization parameters in 100 splits 
in logarithmic scale from 0.01 to 500. 

The results of the experiment are given in Table [21 which strongly indicate that vector- 
based logistic regression is clearly outperformed by the overlapped and scaled latent trace 
norms. Also, in most cases, the averaged matrix method performs poorly compared to 
the optimal tensor structured regularization methods. Mode-1 regularization performs 
poorly since mode-1 was high ranked compared to the other modes. Similarly, the latent 
trace norm gives poor performance since it cannot properly regularize since it does not 
consider the rank relative to the mode dimension. For all subjects, mode-2 and mode-3 
unfolded regularizations result in the same performance due to the symmetry of each 
Xi resulting in same rank along mode-2 and mode-3 unfoldings. For subject aa, the 
scaled latent norm, mode-1, mode-2, and mode-wise cross validation give the best or 
comparable performance. In subject al , all classifiers except the latent norm and mode-1 
regularization gives comparable performance. For all other subjects except for aa and al, 
the overlapped trace norm gives the best performance. 

In contrast to the computation time for regression experiments, in this experiment, 
we see that the computation time for tensor trace norm regularizations are more expen¬ 
sive compared to the mode-wise regularization. Also, the mode-wise cross validation is 
computationally less expensive than the scaled latent trace norm and other tensor trace 
norms. This is a slight drawback with the tensor norms, though they tend to have higher 
classification accuracy. 


6 Conclusion and Future Work 

In this paper, we have studied tensor-based regression and classification with regulariza¬ 
tion using the overlapped trace norm, the latent trace norm, and the scaled latent trace 
norm. We have provided dual optimization methods, theoretical analysis and experimen¬ 
tal evaluations to understand tensor-based inductive learning. Our theoretical analysis 
on excess risk bounds showed the relationship of excess risks with the multilinear ranks 
and dimensions of the weight tensor. Our experimental results on both simulated and 
real data sets further confirmed the validity of our theoretical analyses. From the theo¬ 
retical and empirical results, we can conclude that the performance of regularization with 
tensor norms depends on the multilinear ranks and mode dimensions, where the latent 
and scaled latent norms are more robust in tensors with large variations of multilinear 
ranks. 

Our research opens up many future research directions. For example, an important 
direction is on improvement of optimization methods. Optimization over the latent ten¬ 
sors that results in the use of the latent trace norm and the scaled latent trace norm 
increases the computational cost compared to the vectorized methods. Also, computing 
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Table 2: Classification error of experiments with the BCI competition IVa data set. The boldfaced figures on the columns aa, al, 
av, aw, and ay indicate comparable accuracies among classifiers after a t-test with significance of 0.05. 


Norm 

Subject 

aa 

Subject 

al 

Subject 

av 

Subject 

aw 

Subject 

ay 

mm 

Overlapped Trace Norm 

0.2205(0.0139) 

0 . 0178 ( 0 . 0 ) 

0 . 3244 ( 0 . 0132 ) 

0 . 0603 ( 0 . 0071 ) 

0 . 1254 ( 0 . 0190 ) 

17986(1489) 

Latent Trace Norm 

0.3107(0.0210) 

0.0339(0.0056) 

0.3735(0.0218) 

0.1549(0.0381) 

0.4008(0.0) 

20021(14024) 

Scaled Latent Trace Norm 

0 . 2080 ( 0 . 0043 ) 

0 . 0179 ( 0 . 0 ) 

0.3694(0.0182) 

0.0804(0.0) 

0.1980(0.0476) 

77123(149024) 

Mode-1 

0.3205(0.0174) 

0.0339(0.0056) 

0.3739(0.0211) 

0.1450(0.0070) 

0.4020(0.0038) 

5737(3238) 

Mode-2 

0 . 2035 ( 0 . 0124 ) 

0 . 0285 ( 0 . 0225 ) 

0.3653(0.0186) 

0.0790(0.0042) 

0.1794(0.0025) 

5195(1446) 

Mode-3 

0 . 2035 ( 0 . 0124 ) 

0 . 0285 ( 0 . 0225 ) 

0.3653(0.0186) 

0.0790(0.0042) 

0.1794(0.0025) 

5223(1452) 

Mode-wise CV 

0 . 2080 ( 0 . 0369 ) 

0.0428(0.0305) 

0.3545(0.01255) 

0.1008(0.0227) 

0 . 1452 ( 0 . 0224 ) 

14473(4142) 

Averaged Matrix 

0.2732(0.0286) 

0 . 0178 ( 0 . 000 ) 

0.4030(0.2487) 

0.1366(0.0056) 

0.1825(0.0) 

1936(472) 

Logistic regression^) 

0.3161(0.0075) 

0 . 0179 ( 0 . 0 ) 

0.3684(0.0537) 

0.2241(0.0432) 

0.4040(0.0640) 

72(62) 































































multiple singular value decompositions and solving Newton optimization sub-problems 
(for logistic regression) at each iterative step are computationally expensive. This is evi¬ 
dent from our experimental results on computation time for regression and classification, 
ft would be an important direction to develop computationally more efficient methods 
for learning with tensor data to make it more practical. 

Reg ularization with a mixt ure of norms is common in both vect or-based fe.g., the e las¬ 
tic net fjZou and Hast ieL 120031) 1 and matrix-based regularizations (ISavalle et all 120121) . It 
would be an interesting research direction to combine sparse regularization (the Zi-norm) 
to existing tensor norms. There is also a recent r esea rch direction to develop new com¬ 
posite norms such the (k, q )-trace norm ( Richard et all 2014 b Development of composite 
tensor norms can be useful for inductive tensor learning to obtain sparse and low-rank 
solutions. 
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Appendix A 

In this appendix, we derive the dual formulation of the latent trace norms. Let us consider 
a training data set (X, y,;), i = 1 ,... ,m, where A) G M ni x '" xnif . To derive the dual for 
the latent trace norms, we rewrite the primal for the regression of (|5]) as 


K 


nnn^-^-^ + AEH^ 


k =1 


W| 

(fc) I 


tr 


K 


subject to Zi—( ^ W^ k \ Xij+b, % — 1, 


, m. 


i k=l 
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Its Lagrangian can be written by introducing variables ctj e R, i — 1,..., m as 

ra K m / / K 

G (a) = ^ min ^ ^ -(jfc - ^) 2 + A ^ |||| tr + ^ a* ( ^ W (fc) , } + b 


k =1 


ifc=l 


= mm 

21 ,'" ,Zn 


X] ( 9 (Vi~ Z i) 2 + a i Z i ) + m , in b ^2 a i 


K 


i— 1 


5Z - 


fc=i 

K 


H r A E (n<)iitr - MS-E a ‘ x ‘m 

0 II EiE a Ai(fc)||op < Afc 


fc=l 




—oo otherwise 


o E™ i «i = o 

-oo otherwise 


a. 


- ^ f - -oil + a iVi ) + ^ j + b ( Y 1 

i=i A ' fc=1 V j=i / V j=i 

Let us introduce auxiliary variables V^,..., to remove the coupling between the 
indicator functions. Then the above dual solutions can be restated as 

K 


nun 


2 — 1 


Y ( - o a * 2 + a% y % ) + Yj^ x k(V( 


'w > 


k =1 


subject to V ^ oiiXi k — 1,..., K, 


2—1 


T = o. 


(18) 


2—1 


Similarly, we can derive the dual formulation for logistic regression. 


Appendix B 


In this appendix, we prove the theoretical results in Section 4. 

Proof of Lemma 1 : By using the same approach as the one given in 

Wimalawarne et al. 1 20141) : Maurer and Pontil ( 20131) . we rewrite 


R(W) - R(rV°) = [R(W) - R(W)] + [£(VV) - i?(W 0 )] + [J?(W°) - i?(W 0 )]. 

The second term is always negative and based on Hoeffding’s inequality, with probability 
1 — 5/2, the third term can be bounded as \J-4~- 


R(W) - R(W°) < R(W) - R(yV) + 


log(|0 

2 m 


< sup (R(W) - R(W)) + 

k<Bo 


Mf) 

2m 
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Further applying McDiarmid’s inequality, with probability at least 1 — 5, we get the 
following following Rademacher complexity: 


fH =—E sup <7j/((W, 
m ||W||*<Bo 


where <7* G { — 1,1} are Rademacher variables which leads to 

2 m 

-R(VV) — R(W°) < —E sup Y y a i l((W,X i ),y i ) + 
m ||W||*<B 0 


=i 

m 


I log® 

2m 


<^ E su p X>*(w,a)) + 

m iiwii*<b 0 ~i V 


= —E sup / W,V 
m ||W||*<B 0 ' 




2A^ 

< —E sup 

171 ||W||*<B 0 

2\Bo I. 

< - -EM** + 

m 


JIAdlL* + \ (Holder’s inequality) 

y 2m 


/ Io g(g. 

2m 


□ 

Proof of Theorem 1 : First we bound the data-dependent component of EjjAd llovm-iapl- 
For this, we use the following duality relationship borrowed from Tomioka and Suzuki 

fern ah : 

11 Af 11overlap* = inf max 11Hop. 

jV/j(i)_|- f-M( K )=M fc ^ ' 

Since we can take any to equal A4, the above norm can be upper bounded as 


11 11 overlap* ^ mill 11 M 11 op • 

k 

Furthermore, the expectation of the minimum of k can be upper-bounded by the minimum 
of the expectation: 


E|| Ad IIoverlap* < Emin ||M( fc ) Hop < minE||M( fc )|| op . 

k k 


(19) 


Let cr = {(Ji, • • • , a m } be fixed Rademacher variables. Since each T) contains elements 
following the standard Gaussian distribution, it makes each element in Ad a^sample from 


V(0, 


). Based on the standard methods used in Tomioka et al. ( 2011b ). we can 


express ||M (fc) || op as 

11 M{k) 11 o P = sup u T M (k) v . 

ues n fc- 1 ,i;es ni ^ fcn< - 1 

Using Gordan’s theorem as in Tomioka et al. ( 2011b! ). we have 


E||M(fc)|| op < 11cr11 mm(y/nf+ 

k 


( 20 ) 
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Next taking the expectation over cr, we have 


E||<t|| 2 < y E^-11 cr 11| = s/m. 
Combining (1201) and (T2TT) with (TT9l) results in 

E||A^||overiap* < min s/m(^/nf + Jm/). 
k 

Finally, the excess loss can be written as 

K 


( 21 ) 


R(W) - R(W°) < ciA 


B 


k= 1 


Vn .) min( v /nfc + s/n\k) + c 2 \ 


log( 


2 m 


□ 

Proof of Th eorem 2 :To bound th e da ta-dependent component, we use the duality 
result given in Tomioka and Suzuki ( i20131 ): 


latent* = max \\Mr k) || op . 
k 

Since A4 consists of elements following the standard Gaussian distribution, for each mode 
k unfolding, we can write a tail bound ( Tomioka and Suzuki . 2013! ) as 

P(\\M {k) \\ op > \\cr\\(s/nf + + t) < exp(-f 2 / (2a 2 )). 

Using a union bound, we have 

P (max ||M( fc )|| op > ||<t|| ma x(y/nf + Jnfff) + t) < K exp(-f 2 /(2cr 2 )), 
k k 

and this results in 

Emax ||M (fc) || op < ||cr|| max(yn)T + JU) + aC\J 2 log(A'), 

k k 

where C is a constant. Similarly to (EH), taking the expectation over er, we arrive at 
E max ||M(m || op < y/mm&x(^nf + Jn u) + v / rnC' v / 21og(A'), 

k k 

where C is constant. Finally, the excess risk is given as 


R(W)-R(W°) < ci A B 


nnrifc r k 


m 


max 

k 


+ y/nyf) + C\J 2 log(Jl) j +C 2 1 


' ^gd) 

2m 


Proof of Theorem 3: From Tomioka and Suzuki! f 20131 ). we have 

| scaled* lliaX / fh}. | J\T( /;■) 11 op • 


□ 


Using a similar approach to the latent trace norm with the additional scaling of y/nf, we 
arrive at the following excess bound for the scaled latent trace norm: 

R(W)-R(W°) < Ci AbJ- min ( — ) ( max(n k +VN)+Cy/2 log(iF) 

v m k \n k J \ k /V 2m 
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